Review Topic Discovery with Phrases using the Pólya Urn Model
نویسندگان
چکیده
Topic modelling has been popularly used to discover latent topics from text documents. Most existing models work on individual words. That is, they treat each topic as a distribution over words. However, using only individual words has several shortcomings. First, it increases the co-occurrences of words which may be incorrect because a phrase with two words is not equivalent to two separate words. These extra and often incorrect co-occurrences result in poorer output topics. A multi-word phrase should be treated as one term by itself. Second, individual words are often difficult to use in practice because the meaning of a word in a phrase and the meaning of a word in isolation can be quite different. Third, topics as a list of individual words are also difficult to understand by users who are not domain experts and do not have any knowledge of topic models. In this paper, we aim to solve these problems by considering phrases in their natural form. One simple way to include phrases in topic modelling is to treat each phrase as a single term. However, this method is not ideal because the meaning of a phrase is often related to its composite words. That information is lost. This paper proposes to use the generalized Pólya Urn (GPU) model to solve the problem, which gives superior results. GPU enables the connection of a phrase with its content words naturally. Our experimental results using 32 review datasets show that the proposed approach is highly effective.
منابع مشابه
Coupon collector’s problems with statistical applications to rankings
Some new exact distributions on coupon collector’s waiting time problems are given based on a generalized Pólya urn sampling. In particular, usual Pólya urn sampling generates an exchangeable random sequence. In this case, an alternative derivation of the distribution is also obtained from de Finetti’s theorem. In coupon collector’s waiting time problems with m kinds of coupons, the observed or...
متن کاملOn Generalized Pólya Urn Models
We study an urn model introduced in the paper of Chen and Wei [2], where at each discrete time step m balls are drawn at random from the urn containing colors white and black. Balls are added to the urn according to the inspected colors, generalizing the well known Pólya-Eggenberger urn model, case m = 1. We provide exact expressions for the expectation and the variance of the number of white b...
متن کاملRun Statistics Defined on the Multicolor Urn Model
Recently, Makri, Philippou and Psillakis (2007b) studied the exact distribution of success run statistics defined on an urn model. They derived the exact distributions of various success run statistics for a sequence of binary trials generated by the Pólya–Eggenberger sampling scheme. In our study we derive the joint distributions of run statistics defined on the multicolor urn model using a si...
متن کاملOn Sampling without Replacement and Ok-corral Urn Models
In this work we discuss two urn models with general weight sequences (A,B) associated to them, A = (αn)n∈N andB = (βm)m∈N, generalizing two well known Pólya-Eggenberger urn models, namely the so-called sampling without replacement urn model and the OK Corral urn model. We derive simple explicit expressions for the distribution of the number of white balls, when all black have been drawn, and ob...
متن کاملTSDPMM: Incorporating Prior Topic Knowledge into Dirichlet Process Mixture Models for Text Clustering
Dirichlet process mixture model (DPMM) has great potential for detecting the underlying structure of data. Extensive studies have applied it for text clustering in terms of topics. However, due to the unsupervised nature, the topic clusters are always less satisfactory. Considering that people often have some prior knowledge about which potential topics should exist in given data, we aim to inc...
متن کامل